Zurich - 27 & 28 June 2022
The methods to deal with missing data, implicitly assume a missing data mechanism.
MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.
MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).
MNAR: least strict assumption.
Complete-case analysis (CCA): only the cases with observed data for all variables involved are used in the analysis.
Parameter estimates:
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.347 0.205 1.69 0.0935 ## 2 X2 0.225 0.0922 2.44 0.0167
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.418 0.325 1.29 0.204 ## 2 X2 0.245 0.170 1.44 0.157
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.347 0.205 1.69 0.0935 ## 2 X2 0.225 0.0922 2.44 0.0167
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.0302 0.335 -0.0900 0.929 ## 2 X2 0.0942 0.150 0.627 0.534
Imputation: relpacing the missing with a value.
Possible values to impute?
Other examples:
| gender | n | wgt n | wgt mean | wgt sd | prf n | prf mean | prf sd |
|---|---|---|---|---|---|---|---|
| Man | 81 | 73 | 86.009 | 9.660 | 73 | 7.998 | 3.539 |
| Woman | 61 | 50 | 75.782 | 8.478 | 52 | 9.782 | 3.238 |
Parameter estimates:
Conditional mean imputation, estimating the imputed value as a predicted value from a regression model: $Y_{imp} = _0 + _1 * X
Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender\)
Stochastic regressions: = regression imputation, with additional sampling error added to the predicted value:
Sampling error is normally distributed.
Imputation uncertainty is not taken into account
Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender + \epsilon\)
Ad hoc method for longitudinal data: use the previous observed value to impute the missing values.
Assumes that people that drop out of the study remain stable.
## id group 0 3 6 12 24 ## 76 1 1 1 1 1 1 1 0 ## 16 1 1 1 1 1 1 0 1 ## 9 1 1 1 1 1 0 0 2 ## 7 1 1 1 1 0 0 0 3 ## 2 1 1 1 0 0 0 0 4 ## 0 0 0 2 9 18 34 63
The trajectories over time, with LOCF imputations in red.
Below only the cases with missing observations, separated by group.
miceLater more on methods
Imputation model
Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.
Each imputed dataset is analyzed, with the substantive analysis model
This results in \(m\) sets of results
Workflow:
## # A tibble: 10 x 6 ## term estimate std.error statistic p.value nobs ## <chr> <dbl> <dbl> <dbl> <dbl> <int> ## 1 (Intercept) 19.5 5.70 3.42 0.000792 153 ## 2 Solar.R 0.122 0.0275 4.44 0.0000169 153 ## 3 (Intercept) 21.5 5.65 3.80 0.000210 153 ## 4 Solar.R 0.107 0.0273 3.93 0.000127 153 ## 5 (Intercept) 21.4 5.53 3.87 0.000162 153 ## 6 Solar.R 0.106 0.0271 3.90 0.000142 153 ## 7 (Intercept) 23.1 5.85 3.95 0.000120 153 ## 8 Solar.R 0.110 0.0283 3.89 0.000149 153 ## 9 (Intercept) 22.3 5.80 3.84 0.000177 153 ## 10 Solar.R 0.117 0.0281 4.17 0.0000513 153
## Class: mipo m = 5 ## term m estimate ubar b t dfcom ## 1 (Intercept) 5 21.5600906 3.257524e+01 1.803273e+00 3.473916e+01 151 ## 2 Solar.R 5 0.1124339 7.638285e-04 4.843572e-05 8.219513e-04 151 ## df riv lambda fmi ## 1 123.0708 0.06642860 0.06229072 0.07716663 ## 2 118.0594 0.07609413 0.07071327 0.08606584
Pooling of point estimates that are normally distributed over the imputed datasets.
Means, standard deviations, regression estimates, linear predictors, proportions.
For pooling point estimates, use mean:
\(\hat\theta = \sum^m_{i=1}{\theta_i}\)
Pooling of variance or standard error around the estimate, combine the within and between imputation variance.
Between variance:
\(\sigma^2_{between} = \frac{\sum^m_{i=1}(\beta_i - \overline\beta)^2}{m-1}\)
Within variance:
\(\sigma^2_{within} = \frac{\sum^m_{i=1}\sigma^2_i}{m}\)
Total variance:
\(\sigma^2_{total} = \sigma^2_{within} + \sigma^2_{between} + \frac{\sigma^2_{between}}{m}\)
In the imputation phase, imputed values are estimated using an imputation method.
method = norm.predictmethod = norm.nobmethod = normmethod = pmm| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ozone | 1 | 116 | 42.129310 | 32.987884 | 31.5 | 37.797872 | 25.94550 | 1.0 | 168.0 | 167 | 1.2098656 | 1.1122431 | 3.0628482 |
| Solar.R | 2 | 146 | 185.931507 | 90.058422 | 205.0 | 190.338983 | 98.59290 | 7.0 | 334.0 | 327 | -0.4192893 | -1.0040581 | 7.4532881 |
| Wind | 3 | 153 | 9.957516 | 3.523001 | 9.7 | 9.869919 | 3.40998 | 1.7 | 20.7 | 19 | 0.3410275 | 0.0288647 | 0.2848178 |
| Temp | 4 | 153 | 77.882353 | 9.465270 | 79.0 | 78.284553 | 8.89560 | 56.0 | 97.0 | 41 | -0.3705073 | -0.4628929 | 0.7652217 |
| Month | 5 | 153 | 6.993464 | 1.416522 | 7.0 | 6.991870 | 1.48260 | 5.0 | 9.0 | 4 | -0.0023448 | -1.3167465 | 0.1145191 |
| Day | 6 | 153 | 15.803922 | 8.864520 | 16.0 | 15.804878 | 11.86080 | 1.0 | 31.0 | 30 | 0.0026001 | -1.2224406 | 0.7166540 |
Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)
Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.
Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)
Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.
\(\epsilon\) is normally distributed residual error
Imputed value: \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
\(\epsilon\) is normally distributed residual error
method = logregmethod = polyregmethod = cartImputed value: \(Log\frac{P(Y_{miss})}{1-P(Y_{mis})} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
\(\epsilon\) is normally distributed residual error
Fit a multinomial regression model.
Parameters are drawn from their posterior distribution (Bayesian).
Compute the predicted category.
Add normally distributed residual error to account for sampling variance.